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Abstract 



An extension of the synchronous parallel kinetic Monte Carlo (pkMC) algorithm 
developed by Martinez et al [J. Comp. Phys. 227 (2008) 3804] to discrete lattices 
is presented. The method solves the master equation synchronously by recourse to 
null events that keep all processors time clocks current in a global sense. Boundary 
conflicts are rigorously solved by adopting a chessboard decomposition into non- 
interacting sublattices. We find that the bias introduced by the spatial correlations 
attendant to the sublattice decomposition is within the standard deviation of the se- 
rial method, which confirms the statistical validity of the method. We have assessed 
the parallel efficiency of the method and find that our algorithm scales consistently 
with problem size and sublattice partition. We apply the method to the calculation 
of scale-dependent critical exponents in billion-atom 3D Ising systems, with very 
good agreement with state-of-the-art multispin simulations. 
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1 Introduction 



Kinetic Monte Carlo (kMC) pQ has proven an efficient and powerful tool to 
study non-equilibrium processes, and is used in fields as different as popula- 
tion dynamics, irradiation damage, or crystal growth [2|3] . The most widely 
used variant of the method is the Monte Carlo time residence algorithm [I] , 
also known as rejection-free n-fold method, or BKL in reference to its au- 
thors |5j. Although kMC is generally capable of advancing the time scale 
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significantly faster than direct, time-driven methods, it suffers from numeri- 
cal limitations such as stiffness [6] , and time asynchronicity. This has spurred 
the development of more powerful variants such as coarse-grained kMC [7J, 
first-passage kMC [S] , and other accelerated methods [H] . In this sense, a num- 
ber of parallelization schemes for kMC have been proposed, including rigorous 
and semi-rigorous algorithms based on asynchronous kinetics [T0fll|ll2j . These 
methods rely on cumbersome roll-back procedures to avoid causality errors, 
i.e. event time incompatibilities associated with processor communications. 
For this reason, most applications of interest are studied using approximate 
schemes (non-rigorous) for computational convenience. In spite of this, calcu- 
lations using asynchronous parallel kMC have provided numerous insights in 
several studies, most notably crystal growth [13J. 

Recently, we have developed and alternative algorithm based on a synchronous 
time decomposition of the master equation [Hj. Our parallel kinetic Monte 
Carlo method, eliminates time conflicts by recourse to null events that advance 
the internal clock of each processor in a synchronized fashion without altering 
the stochastic trajectory of the system. The method has been demonstrated 
for continuum diffusion/reaction systems, which represents a worst-case appli- 
cation scenario for two reasons. First, the maximum time step gain is limited 
by the intrinsic length scale of the problem at hand, which in concentrated 
systems may not be large; second, spatial boundary errors are difficult to elim- 
inate due to the unbounded nature of diffusion in a continuum setting. This 
latter feature also limits the parallel efficiency of the algorithm, as global com- 
munications are needed during every Monte Carlo step. In our synchronous 
parallel kMC method (spkMC), the parallel error can always be computed 
intrinsically and reduced arbitrarily (albeit at the expense of efficiency). In 
this paper, we extend spkMC to lattice systems, where diffusion lengths are 
quantized and boundary errors can be eliminated altogether. First, we adapt 
the algorithm proposed in Ref. [H] to discrete systems. Due to its relevance 
and well-known properties, we have chosen the three-dimensional (3D) Ising 
system as our underlying lattice model. Second, we analyze the performance of 
the method in terms of stochastic bias (error) and parallel efficiency. We then 
apply spkMC to large systems near the critical point, which provides a de- 
manding testbed for the method, as this is where fluctuations are exacerbated 
and convergence is most difficult. 



2 The 3D Ising model 



The Ising model is one of the most extensively studied lattice systems in 
statistical mechanics. It consists of a lattice of iV sites occupied by particles 
whose state is described by a variable <7j that represents the spin of each 
particle and can take only the value ±1. For pair- interactions, the Hamiltonian 



that gives the energy of configurational state o = {<jj} takes the form: 

where J is the coupling constant between neighboring pairs (i,j) and H is 
an external (magnetic) field. The time evolution of the system is assumed to 
be described by a master equation with Glauber dynamics [15], which states 
that the probability p(a, t) of finding the system in state a at time t obeys the 
equation: 

dp(a, t) 



Of 



-£[Wi(<j)p(a,t) -Wiia'Ma'it)} (2) 



where a 1 denotes the configuration obtained from a by flipping the i th spin 
with transition rate Wj(cr): 

Wi(a) = ^ [1 - (n tanh (2/3 AE f )] (3) 

Here A is a positive constant that represents the natural frequency of the 
system, (3 is the reciprocal temperature, and AE{ = —JJ2(i,j) Cj ~ Hai is the 
energy associated with spin i, which follows directly from eq. (JT|). In what 
follows we consider only internally-driven systems (H = 0). 

Many discrete systems can be mapped exactly or approximately to the Ising 
system. The grand canonical ensemble formulation of the lattice gas model, 
for example, can be mapped exactly to the canonical ensemble formulation 
of the Ising model. Also, binary alloy hamiltonians with nearest-neighbor in- 
teractions in rigid lattices can also be expressed as Ising hamiltonians. These 
mappings allow us to exploit results and behaviors of the Ising model to answer 
questions about the related models. In addition, the Ising system is particu- 
larly useful to study second-order phase transitions. The temperature T c at 
which such transitions occur is known as the critical temperature. During the 
phase transition, thermodynamic quantities diverge according to power laws 
of T, whose exponents are known as the critical exponents. The nature of 
the phase transition is determined by whether the order parameter is contin- 
uous at T c [16] . In a ferromagnetic system such as the Ising model, the order 
parameter is the net magnetization m(a): 

m (° r ) = ]^E^ ( 4 ) 

i 

For simple cubic lattices in 3D, N = L 3 is the number of sites, with L the 
lattice size. As we approach the critical temperature from T > T c , uncorrelated 
groups of spins align themselves in the same direction. These clusters grow in 
size, known as the correlation length £, which too diverges at the critical 
point. At T = T c , one may theoretically encounter arbitrarily large areas 



with correlated spins pointing in one direction. In finite systems the upper 
limit of £ is the system's dimension L. Thus, the challenge associated with 
simulating Ising systems during the phase transition is then ensuring that the 
error incurred by simulating a finite-size lattice is sufficiently small for the 
critical exponents to be calculated with certainty This has spurred a great 
many Monte Carlo simulations of very large lattices in the hope of finding 
converged critical exponents (cf., e.g., Ref. [T7]). 

When the system is in a ferromagnetic state, m decays from the spontaneous 
magnetization value mo with time asm k t~ K / uz , where k and v are the 
critical exponents for mo and £, respectively. From the known value of the 
ratio k/v = 0.515 [5T] in 3D, one can obtain z from the slope of the m-t curve, 
obtained for several L, at the critical point. To study the finite-size dependence 
of T c , high-order dimensionless ratios such as the Binder cumulant have been 
proposed: 

U< = M (5) 

which takes a value of C/4 ps 3 when T > T c (when the magnetization oscillates 
aggressively around zero), and goes to zero at low temperatures, when m = itlq. 
As mentioned earlier, at the critical point the correlation length diverges, and 
therefore C/4 does not depend on L. kMC equilibrium calculations of C/4 for 
several lattice sizes can then be used to calculate the value of the reciprocal 
critical temperature (3 C = J/kT c , whose most accurate estimate is presently 
P e = 0.2216546 [ISim 



3 Parallel algorithm for lattice systems 

3.1 General algorithm 



The basic structure of the algorithm is identical to that described in Ref. 
First, the entire configurational space is partitioned into K subdomains Qk- 
Note that, in principle, this decomposition need not be necessarily spatial (al- 
though this is the most common one), and partitions based on some other kind 
of load balancing can be equally adopted. However, without loss of generality, 
in what follows it is assumed that the system is spatially partitioned: 

1. A frequency line is constructed for each Clk as the aggregate of 
the individual rates, r^, corresponding to all the possible events 
within each subdomain: 



Rk — 2_^ T ik 



where nk and Rk are, respectively, the number of possible events and the 
total rate in each subdomain k. Here R to t = Ylk ^fc anc ^ N = Y,k n k- 

2. We define the maximum rate, R max , as: 

Rmax > max {R k } 

k=l,...K 

This value is then communicated globally to all processors. 

3. We assign a null event with rate r^ to each frequency line in 
each subdomain k such that: 

'Ok -t^max -ffe 

where, in general, the rok will all be different. We showed in Ref. [13] 
that the condition for maximum efficiency is that step (2) become strictly 
an equality, such that: 

3 Q a , a G {k}, | R a = R max -> r a0 = 

i.e. there is no possibility of null events. However, in principle, each sub- 
domain can have any arbitrary rok as long as all the frequency lines in 
each Qk sum to the same global value. This flexibility is one of the most 
important features of our algorithm. 

4. In each Qj, an event is chosen with probability p^ = rik/R max , 
including null events with p^o = Tko/R ma x- For this step, we must 
ensure that independent sequences of random numbers be produced for 
each Qk, using appropriate parallel pseudo random number generators. 

5. As in standard BKL, a time increment is sampled from an ex- 
ponential distribution: 

dt p = — 

-L^max 

where £ e (0, 1) is a suitable random number. Here, by virtue of 
Poisson statistics, St p becomes the global time step for all of the parallel 
processes. 

6. Communicate boundary events. A global call will always achieve 
communication of boundary information. However, depending on the 
characteristics of the problem at hand, local calls may suffice, typically 
enhancing performance. 



3.2 The sublattice method for solving boundary conflicts 



As we have shown, this algorithm solves the master equation exactly for non- 
interacting particles. When particles are allowed to interact across domain 



boundaries, suitable corrections must be implemented to avoid boundary con- 
flicts. For lattice-based kinetics with short-ranged interaction distances this 
is straightforwardly achieved by methods based on the chessboard sublattice 
technique. This spatial subdivison method has been used in multispin cal- 
culations of the kinetic Ising model since the early 1990s [20j21f22j . In the 
context of parallel kMC algorithms, Shim and Amar were the first to imple- 
ment such procedure [23], in which a sublattice decomposition was used to 
isolate interacting domains in each cycle. The minimum number of sublattices 
to ensure non-interacting adjacent domains depends on a number of factors, 
most notably system dimensionality 1 1 1. In 3D, the chessboard method requires 
a subdivison into a minimum of either two or eight sublattices, depending on 
whether only first or farther nearest neighbor interactions are considered. This 
is schematically shown in Figure HJ where each sublattice is defined by a spe- 
cific color. The Figure shows the minimum sublattice block (white wireframe) 
to be assigend to each processor. These blocks are indivisible and each pro- 
cessor can be assigned only integer multiples of them for accurate spkMC 
simulations. The implementation of the sublattice algorithm is as follows. Be- 





Fig. 1. Sublattice coloring scheme in three dimensions with regular subdivisions. 
Both two and eight color subdivisions are shown, corresponding to first and far- 
ther nearest neighbor sublattice interactions. The white wireframe indicates the 
indivisible color block assigned to processors 

cause here eq. [I] only involves first nearest neighbor interactions, the spatial 
decomposition performed in this work is such that it enables a regular sublat- 
tice construction with exactly two colors . In this fashion, each Qk becomes 
a subcell of a given sublattice, which imposes that each processor must have 
a multiple of two (or eight, for longer range interactions) number of subcells. 
Using a sublattice size greater than the particle interaction distance guaran- 
tees that no two particles in adjacent flk interact. Step (4) above might then 
be substituted by the following procedure: 



1 In 2D, four sublattices are sufficient to resolve any arbitrary mapping, as estab- 
lished by the solution to the 'four-color problem' 



4a. A given sublattice is chosen for all subdomains. This choice may be 
performed in several ways such as fully random or using some type of color 
permutation so that every sublattice is visited in each kMC cycle. Here 
we have implemented the former, as, for example, in the synchronous 
sublattice algorithm (SSL) of Shim and Amar [23]. The sublattice se- 
lection is performed with uniform probability thanks to the flexibility 
furnished by the spkMC algorithm, which takes advantage of the null 
rates to avoid global calls to communicate each sublattice's probability. 
Restricting each processor's sampling to only one lattice, however, while 
avoiding boundary conflicts, results in a systematic error associated with 
spatial correlations. The errors incurred by this procedure will be ana- 
lyzed in Section |U 

4b. An event is chosen in the selected sublattice with the appro- 
priate probability, including null events. When the rate changes in 
each 0,^ after a kMC cycle are unpredictable, a global communication 
of R m ax m step (2) is unavoidable. When the cost of global communica- 
tions becomes a considerable bottleneck in terms of parallel efficiency, it 
is worth considering other alternatives. For the Ising system, we consider 
the following: 



The simplest way to avoid global communications is to prescribe 
Rmax to a very large value so as to ensure that it is never surpassed 
regardless of the kinetics being simulated. For the Ising model, this 
amounts to calculating the maximum theoretical aggregate rate for 
an ensemble of Ising spins. For a given subdomain S7&, this is: 



R r 



Am 



exp(-AE max ) 
1 + exp(-AE max ) 



where E max is the theoretical maximum energy increment due to 
a single spin change: 



E~i 



-2(n b \J\-\H\) 



and n is the lattice coordination number. This procedure is very 
conservative and may result in a poor parallel performance. 



,,, ax . This procedure 
,,.,„ r by recording the 



Perform, a self-learning process to optimize R' 
is aimed at refining the upper estimate of R 
history of rate changes over the course of a pkMC simulation. For 
example, one can start with the maximum theoretical aggregate 
Ising rate and start decreasing the upper bound to improve the 
efficiency. For this procedure, a tolerance to ensure that R'max > 
Rmax must be prescribed. A sufficiently- long time history of this 
comparison must be stored to perform regular checks and ensure 
that the inequality holds. 



This algorithm solves the same master equation as the serial case but it is not 
strictly rigorous. As we have pointed out, the sampling strategies adopted to 
solve boundary conflicts introduce spatial correlations that result in stochastic 
bias. Under certain conditions spkMC does behave rigorously in the sense that 
this bias is smaller than the intrinsic statistical error. We explore these issues 
in the following section. 



4 Stochastic bias and analysis of errors 



The algorithm introduced above eliminates the occurrence of boundary con- 
flicts at the expense of limiting the sampling configurational space of the 
system in each kMC step. Because boundary conflicts are inherently a spatial 
process, this introduces a spatial bias that must be quantified to understand 
the statistical validity of the spkMC results. Next, we analyze this bias by test- 
ing the behavior of the magnetization when the system is close to the critical 
point (<r c ). All the results shown in this section correspond to the sublattice 
algorithm using two colors with random selection. 

The bias is defined as the difference between a parallel calculation and a 
reference calculation usually taken as the mean of a sufficient number of serial 
runs Li|: 

bias = (m(a c )) p - (m{a c )) s (6) 

where {m(a c )) p and {m(a c )) s are the averages of a number of independent runs 
in parallel and in serial, respectively, for a given total Ising system size. The 
initial (m(t = 0) = mo) and boundary (periodic) conditions in both cases are 
identical. In Figure [2] we show the time evolution of the magnetization of an 
A r =262,144-spin (2 18 ) Ising system, averaged over 20 serial runs, used as the 
reference for the calculation of the bias. The purple shaded region gives the 
extent of the standard deviation, which is initially very small, when m is very 
close to one, but grows with time as the system approaches its paramagnetic 
state and fluctuations are magnified. The shaded region (in gold) between 
10~ 4 < t < 5xlO _3 A _1 has been chosen for convenience and marks the time 
interval over which eq. [6] is solved 3 1. The same exercise has been repeated for 
a 2,097,152-spin (2 21 ) sample with 5 serial runs performed (not shown). 

Figure [3] shows the time evolution of the bias for a number of parallel runs 
corresonding to the two system sizes studied. The shaded area in the figure 
corresponds to the interval contained within the standard deviation of the 
serial case (cf. Fig. [2]). Therefore, this analysis yields the maximum number 



2 If available, an analytical solution may of course be used as a reference as well. 

3 And where the critical exponent in Section [6] is measured. 
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Fig. 2. Time evolution of the magnetization of a 262,144-spin Ising system at the 
critical temperature. The curve is the result of 20 independent serial kMC runs. 
The standard deviation, a s , is represented by the purple shaded area about the 
magnetization curve. The golden shaded area marks the region over which the bias 
is computed. 

of parallel events that can be considered to obtain a solution statistically 
equivalent to that given by the serial case. 

The figure shows up to what number of parallel processes can the serial and 
sublattice methods be considered statistically equivalent in the entire range 
where the bias is calculated. For the 262,144-spin system this is 32, whereas 
for the 2,097,152 one it is approximately 256. However, runs whose errors 
are larger than the serial standard deviation at short time scales (e.g. >64 
and >512 for, respectively, the 262,144 and 2,097,152-spin systems) gradually 
reduce their bias as time progresses. In fact, at £ > 2 x 10~ 3 A -1 , all parallel 
runs fall within a s . It appears, therefore, that fluctuations play an important a 
role in the parallel runs for low numbers of processes an spkMC cycles. As the 
accumulated statistics increases (more cycles), this effect gradually disappears. 
In any event, the bias is never larger than ^2% for all cases considered here. 

Although Fig. [3]) provides an informative quantification of the errors intro- 
duced by the parallel method, it is also important to separate this systematic 
bias from the statistical errors associated with each set of independent parallel 
runs. This is quantified by the standard deviation of the time-integrated bias, 
defined as: 

°t = \hl + ol (7) 
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Fig. 3. Time evolution of the parallel bias for two Ising system sizes at the critical 
point. Results for parallel runs with several numbers of processors are shown. The 
shaded region corresponds to the interval contained within the standard deviation 
of the reference (serial) case. Note the logarithmic scale for the abscissa. 

where the terms inside the square root are the parallel and serial variances 
respectively. We next solve eqs. (EJ) and (GO) during the time interval prescribed 
above for the two system sizes considered in Fig. [3j The absolute value of the 
systematic bias is extracted from a number of independent runs (10 and 5 
respectively) and plotted in Figure H] as a function of the number of parallel 
processes. Note that the number of parallel processes is equal to the total 
number of subcells divided by the number of different sublattices (=2, in our 
case). 

The figure shows that the absolute value of the bias is always smaller than the 
statistical error (i.e. the error bars always encompass zero bias). This implies 
that, in the range explored, a given problem may be solved in parallel and the 
result obtained can be considered statistically equivalent to a serial run. The 
bias is roughly constant and always below 0.5% in the entire range explored for 
both cases. However, the bias is consistently lower for the larger system size, 
as are the error bars. This is simply related to the moderation of fluctuations 
with system size. 

An analysis such as that shown in Figure H] allows the user to control the par- 
allel error by choosing the problem size and the desired number of particles 
per subcell. Consequently, our method continues to be a controlled approxima- 
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Fig. 4. Systematic parallel bias for 262,144 and 2,097, 152-spin 3D Ising systems 
(obtained, respectively, from 10 and 5 independent runs) as a function of the number 
of parallel processes. Note that the number of parallel processes is equal to the total 
number of subcells divided by the number of different sublattices (=2, in our case). 

tion in the sense that the error can be intrinsically computed and arbitrarily 
reduced. 



5 Algorithm performance 



The algorithm's performance can be assessed via its two fundamental con- 
tributions, namely, one that is directly related to the implementation of the 
minimal process method (MPM) through the null events [25], and the parallel 
performance per se. The effect of the null events is quantified by the utilization 
ratio (UR): 

UR = i - Jp|I°L ( 8 ) 

which gives the relative weight of null events on the overall frequency line. The 
UR determines the true time step gain associated with the implementation of 
the MPM as [H]: 

5t* = K ■ UR ■ St, 

where St* and St s are, respectively, the MPM and standard time steps. This 
procedure is intrinsically serial, and will result in superlinear scalar behavior 
if not taken into account for parallel performance purposes. Next, we show 



11 



in Figure E the evolution of the UR for 524,288 (2 19 ) and 1,048,576 (2 20 ) 
spin systems. We have done calculations for several numbers of processors 
and number of particles per subcell. We find that the determining parame- 
ter is the latter, i.e. for a fixed system size and number of processors used, 
the UR displays a strong dependence with the number of particles per sub- 
cell. The figure shows results for 512 and 4096 particles per subcell, which in 
the 524,288(l,048,576)-spin system amounts to, respectively, 1024(2048) and 
128(256) subcells per processor. In the latter case, the UR eventually oscillates 
around ~ 82%, whereas in the former it is approximately 90% (i.e. on average, 
~ 18% and 10% of events, respectively, are null events). 
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Fig. 5. Evolution of the utilization ratio (UR) with simulated time for 524,288 (2 19 ) 
and 1,048,576 (2 20 ) particle Ising system. Calculations have been done varying the 
number of sublattices per processor, which is shown to have a significant impact on 
the UR. 

For its part, the parallel efficiency is defined as the wall clock time employed 
in a serial calculation relative to the wall clock time of a parallel calculation 
with K processors involving a f^-fold increase in the problem size: 



V 



ts(l) 

t p (K) 



(9) 



The inverse of the efficiency gives the weak-scaling behavior of the algorithm. 
Due to the absence of fluctuations that exist in other parallel algorithms based 
on intrinsically asynchronous kinetics (cf., e.g., Ref. [23]), the ideal parallel 
efficiency of pkMC is always 100%. 
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Let us now consider the efficiency for the following weak-scaling problem. As- 
suming that frequency line searches scale linearly with the number of walkers 
in our system, the serial time expended in simulating a system of N spins to 
a total time T is: 

t s (l)=n s (t exe + 0(N)) 

where n s is the number of cycles required to reach T, and t exe is the computa- 
tion time during each kMC cycle. For its part, the total parallel time for the 
i^-fold system is: 

t p {K) = n p (t exe + 0(N)+ t c ) 

where n p is the counterpart of n s and t c = t g + U is the communications 

overhead due to global and local calls. In the most general case, the efficiency 

is then: 

_ n s (U + Q{N)) 

1 n p (t exe + 0(N)+t c ) { } 

As mentioned in the paragraph above, when it is ensured that the serial al- 
gorithm also take advantage of the time step gain furnished by the minimal 
process method, the number of cycles to reach T is the same in both cases, 
n s = n p . The parallel efficiency then becomes: 

v = t~ + °W (11) 

t eX£ + 0(N) + tg + h V ' 

Next, by virtue of the logP model [26], we assume that the cost of global 
communications is O (log if) , with b a constant, while the local communica- 
tion time, ti, is independent of the number of processors used and scales with 
the problem size as \/N [27]. If we consider the execution time t exe negligible 
compared to the communication time, we have: 

C ° N l (12) 



1 c N + Cl (\ogK) b + c 2 ^N " (ci/cqN) (log if) 6 + (l + c 2 /c^ 

where cq, c.\ and c 2 are architecture-dependent constants The final expression 
then reduces to: 

77 = n ^o. (13) 

a (log K ) + c 

where a and c are an architecture and problem dependent constants. 

In the case where R ma x is overdimensioned a priori to a prescribed tolerance 
TOL of the true value, R* max ~ -Rat(i + TOL), then t g = and eg. [HH becomes: 

t exe + 0(N) 



[1 + TOL) (t^ + 0(N)+t t ) 



stemming from the fact that now the ratio n s /n p = 5t p /St s = R max /R^ nax . 
Assuming again that t exe is negligible with respect to ti and the 0(N) term 
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for frequency line searches, the expression for the efficiency takes the form: 

cN - l 

V ~ (1 + TOL) (ciV + £ ? ) ~ c(l + TOL) (15) 

where c is the same as in eq. [131 

Combining eqs. (!T3|) and ( TT5|) . we arrive at the criterion to choose the optimum 
algorithm: 

TOL< -(login 6 
c 

i.e. as long as the above inequality is satisfied, avoiding global calls by con- 
servatively setting R max at the beginning of the simulation results in a more 
efficient use of parallel resources. Note that, via the constants a, b, and c, this 
is problem and machine-dependent, and establishing these with confidence 
may require considerable testing prior to engaging in production runs. 

Next, we perform scalability tests for the case where R ma x is communicated 
globally, i. e. the efficiency is governed by eq. [131 The tests have been carried 
out on LLNL's distributed-memory parallel platforms, specifically the "hera" 
cluster using Intel compilers [28]. The scalabilitry calculations were all per- 
formed for 512 particles per subcell, regardless of the number of processors 
used, for systems with three different numbers of spins per processor, namely 
4,194,304 (2 21 ), 2,097,152 (2 20 ), and 1,048,576 (2 19 ). This means that as the 
number of particles per processor is increased, more subcells are assigend to 
each processor. Figure [6] shows the parallel efficiency of the pkMC algorithm 
as a function of the number of processors used for three reference Ising sys- 
tems at the critical point. The fitting constants a, b and c are given for each 
case. As the figure shows, the number of spins per processor has a significant 
impact on the parallel efficiency, with larger sizes resulting in better perfor- 
mances. The efficiency at K = 256 is upwards of 80% for the largest system, 
and ~ 60% for the smallest system size. The leap in efficiency observed in all 
cases between 2 and 4 processors is caused by the nodal interconnects (band 
width) connecting quad cores in the platforms used. As expected from eq. [12j 
the fitting constant a scales roughly as iV _1 , while b does not display a large 
variability and takes value between 0.41 and 0.49. c can be considered equal 
to one for all three cases within the least-squares error, which implies negligi- 
ble local communication costs (c-i ~ in eq. |T2"|) . and simplifies the tolerance 
criterion above to TOL < a(\ogK) b . 

The idea behind using a tolerance to minimize or contain global calls forms 
the basis of the so-called optimistic algorithms [2"9~f3"U] . where the parameter(s) 
controlling the parallel evolution of the simulation are set conservatively - 
either by a self-learning procedure or by accepting some degree of error — 
and monitored sparingly. For example, for the 2 20 -spin Ising system, a = 0.11, 
b = 0.41, c = 1.00, TOL varies between <0.09 and <0.22 in the range 2 < K < 
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Fig. 6. Parallel efficiency for three different weak-scaling problems, one with 2 20 , 
one with 2 21 , and another with 2 22 particles per processor of an Ising system at the 
critical point. The fitting constants a, b and c in eq. [13] are given for each case. 

256. A value of 0.09 may not be sufficient to encompass the time fluctuations 
in Rmax, but it is expected that for a higher number of processors the efficiency 
will improve, although at the cost of the UR. These and more aspects about 
the parallel efficiency and its behavior will be discussed in Section [7J 



6 Application: billion-atom Ising systems at the critical point 



We now apply the method to study the time relaxation of large Ising systems 
near the critical point. As anticipated in Section &\ at the critical point, the 
relaxation time r diverges as £ z , where £ oc \T — T c \~ u . The scaling at T = T c 
is then: 

m(t) oc r /w (16) 

In 3D, we use the known critical temperature J/kT c = 0.2216546 [T8][T9~] to 
find z. We start with all spins +1 and let m(t) decay from its initial value of 
unity down to zero. At each time point, we can find the critical exponent z 
from: 



fl 



d(\o£ 



m) 



(17) 



d(\ogt) 

where the ratio /3/u takes the known value of 0.515 [31] . We have carried 
out simulations with lattices containing 1024x512x512 (2 28 ), 1024x1024x512 
(2 29 ), and 1024x1024x1024 (2 30 ) spins. The results are shown in Figure [7J 
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for critical exponents calculated during t > 0.025A 1 , from time derivatives 
averaged over 300 to 500 timsteps. 
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Fig. 7. Effective dynamical critical exponent as a function of inverse time using 1024 
processors and 1,048,576 spins per subcell for approximately a quarter (2 28 ), half 
(2 29 ), and one (2 30 ) billion-spin Ising systems for t > 0.025A -1 . The horizontal line 
at z = 2.04 marks the consensus value in 3D from the literature. 

At long time scales, the critical exponent oscillates around values that range 
from, roughly, 2.06 to 2.10 depending on system size. This in good agreement 
with the converged consensus value of ~2.04 published in the literature [32] 
(shown for reference in Fig. [7]). However, as time increases, the oscillations 
increase their amplitude with inverse system size. Oscillations of this nature 
also appear in multi-spin calculations, both for smaller [33. 34"f35] and larger 
[36] systems, where the inverse proportionality with system size is also ob- 
served. These may be caused by insufficient statistics due to size limitations, 
as we have shown that, under the conditions chosen for the simulations, our 
calculations are statistically equivelent to serial ones (cf. Fig. SJ). The effect of 
the system size is also clearly manifested in the relative convergence rate of 
z. As size increases, convergence to the expected value of 2.04 is achieved on 
much shorter time scales, i.e. fewer kMC cycles. 



7 Discussion 



We now discuss the main characteristics of our method. We start by consid- 
ering the three factors that affect the performance of our algorithm: 
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(i) Number of particles per subcell. This is the most important variable af- 
fecting the algorithm's performance, as it controls the intrinsic parallel 
bias and the utiliation ratio. Higher numbers of spins per subcell both 
reduce the bias (cf. Figs. E]and|I|) and increase the UR (Fig. EJ), bolster- 
ing performance. However, this also results in an increase of the value 
of Rmax, which causes a reduction in 5t p . Thus, decreasing the bias and 
increasing the time step are actions that may work in opposite directions 
in terms of performance, and a suitable balance between both should be 
found for each class of problems. 

(ii) Number of particles per processor. This parameter affects the parallel 
efficiency via the number of spins per processor Nk (for regular space 
decompositions, KNk ~ N). As Nk increases, a significant improvement 
is observed. This is directly related to the parameter a in eq. [13j which 
scales inversely with N^ and is related to the cost of linear searches. 

(iii) Total system size. As Figs. [3] and H] show, for a given sublattice decom- 
position, a larger system incurs in smaller relative fluctuations in the 
magnetization, which results in a more contained bias. 



Through the constants a, b, and c, the parallel efficiency strongly depends 
on the latency and bandwidth of the communication network used. That is 
why we have explored other efficiency-increasing alternatives that contain the 
number of global calls and the associated overhead. Prescribing a tolerance on 
the expected fluctuations of R max is in the spirit of so-called optimistic kMC 
methods, and ideally its value is set by way of a self-learning procedure that 
maximizes the efficiency. 



In any case, the intersection of items (i)-(iii) above configures the operational 
space that determines the class of problems that our method is best suited 
for: large (multimillion) systems, with preferrably a sublattice division that 
achieves an optimum compromise between time step gain and lowest possible 
bias, with the maximum possible number of particles per processor. These 
are precisely the conditions under which we have simulated critical dynamics 
of 3D Ising systems, with very good results. We conclude that spkMC is best 
designed to study this class of dynamic problems where fluctuations are impor- 
tant and there is unequivocal size scaling. This includes applications on many 
other areas of physics, such as crystal growth, irradiation damage, plasticity, 
biological systems, etc., although other difficulties due to the distinctiveness of 
each problem may arise that may not be directly treatable with the algorithm 
presented here. We note that, because the parallel bias is seen to saturate 
for large numbers of parallel processes, and the efficiency is governed by the 
inverse logarithmic term, the only limitation to using spkMC is given by the 
number of available processors. 
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8 Conclusions 



We have developed an extension of the synchronous parallel kMC algorithm 
presented in Ref. [TJ] to discrete lattices. We use the chess sublattice technique 
to rigorously account for boundary conflicts, and have quantified the resulting 
spatial bias. The algorithm displays a robust scaling, governed by the global 
communications cost as well as by the spatial decomposition adopted. We have 
applied the method to multimillion-atom three-dimensional Ising systems close 
to the critical point, with very good agreement with published state-of-the-art 
results. 
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